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ROBUST PARAMETRIC CLASSIFICATION AND 
VARIABLE SELECTION BY A MINIMUM DISTANCE 

CRITERION 

By Eric C. Chi* and David W. Scott"!" 
University of California, Los Angeles and Rice University 

We investigate a robust penalized logistic regression algorithm 
based on a minimum distance criterion. Influential outliers are often 
associated with the explosion of parameter vector estimates, but in 
(~| the context of standard logistic regression, the bias due to outliers 

always causes the parameter vector to implode, that is shrink towards 
' ' the zero vector. Thus, using LASSO-like penalties to perform variable 

0^ selection in the presence of outliers can result in missed detections of 

relevant covariates. We show that by choosing a minimum distance 

criterion together with an Elastic Net penalty, we can simultaneously 

find a parsimonious model and avoid estimation implosion even in the 
presence of many outliers in the important small n large p situation. 
Implementation using an MM algorithm is described and performance 
evaluated. 



I— ' 1. Introduction. Regression, classification and variable selection prob- 

lems in high dimensional data are becoming routine in fields ranging from 
J> finance to genomics. In the latter case, technologies such as expression arrays 

have made it possible to comprehensively query a patient's transcriptional 
activity at a cellular level. Patterns in these profiles can help refine sub- 
types of a disease according to sensitivity to treatment options or identify 
previously unknown genetic components of a disease's pathogenesis. 

The immediate statistical challenge is finding those patterns when the 
I number of predictors far exceeds the number of samples. To that end the 

Least Absolute Shrinkage and Selection Operator (LASSO) has been quite 
• ^ successful at addressing "the small n, big p problem" (Chen, Donoho and 

^ Saunders, 1998; Tibshirani, 1996). Indeed, £i-penalized maximum likelihood 

model fitting has inspired many related approaches that simultaneously do 
model fitting and variable selection. These approaches have been extended 
from linear regression to generalized linear models. In particular logistic 
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regression fit using the logistic deviance loss with an Elastic Net penalty 
(Zou and Hastie, 2005) has been well studied. Friedman, Hastie and Tibshi- 
rani (2010) provide an R package glmnet that performs Elastic Net penal- 
ized maximum likelihood estimation for generalized linear models. Genkin, 
Lewis and Madigan (2007) worked out a Bayesian formulation employing the 
Laplace prior and applied their model to text classification. Wu et al. (2009) 
proposed minimizing a LASSO penalized logistic likelihood for genome wide 
association studies Finally, Liu et al. (2007) presented algorithms for both 
Elastic Net and concave Bridge penalized logistic likelihood models. 

Nonetheless while -penalized maximum likelihood methods have proved 
their worth at recovering parsimonious models, noticeably less attention has 
been given to extending these methods to handle outliers in high dimensional 
data. For example in biological data, tissue samples may be mislabeled or be 
contaminated. Existing work centers around the Huber loss function (Owen, 
2006; Rosset and Zhu, 2007). Rosset and Zhu (2007) and Wang, Zhu and 
Zou (2008) discuss using a Huberized hinge loss for regularized robust clas- 
sification. In these references, regularized robust estimation procedures are 
introduced, but compelling scenarios which justify their use are not explored. 

It is not surprising that outliers may bias estimation. What is less well 
known is that outliers can strongly influence variable selection. In this paper 
we identify some circumstances that motivate robust variants of penalized 
estimation and develop a minimum distance estimator for logistic regression. 
To address the n <^ p scenario when predictors are correlated we add the 
Elastic Net penalty. We evaluate the performance of our approach through 
simulated and real data. 

Robust methods of logistic regression are not new in the classic n > p 
case. A broad class of solutions consists of downweighting the contribution 
of outlying points to the estimating equations. Downweighting can be based 
on extreme values in covariate space (Carroll and Pederson, 1993; Kunsch, 
Stefanski and Carroll, 1989) or on extreme predicted probabilities (Bianco 
and Yohai, 1996; Carroll and Pederson, 1993; Copas, 1988). 

An alternative approach is to use minimum distance estimation (Donoho 
and Liu, 1988). The minimum distance estimator used in this paper can also 
be seen as a method that downweights the contributions of outliers (Chi, 
2011). The work in Bondell (2005) is similar to ours in that he considered 
fitting parameters by minimizing a weighted Cramer- von Mises distance. 
The difference between the approach proposed here and prior work is the 
application of regularization to handle high dimensional data and perform 
variable selection in the presence of outliers. Moreover, the robust loss func- 
tion we propose has a particularly simple form which, when combined with 
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the Elastic Net penalty, can be solved very efficiently for large problems 
by minimizing a series of penalized least squares problems with coordinate 
descent. 

The rest of this paper is organized as follows. We briefly explain the nota- 
tion and conventions used in this paper in Section 2. In Section 3 we review 
maximum likelihood estimation (MLE) of the logistic regression model and 
demonstrate the potentially deleterious effects of outliers on variable selec- 
tion with the £i-penalized MLE. We introduce our robust loss function in 
Section 4. In Section 5 we describe algorithms for fitting our robust logistic 
regression model. We describe how to select the regularization parameters in 
Section 6. In Sections 7 and 8 we present results on real and simulated data. 
Section 9 concludes with a summary of our work and also future directions. 

2. Notation and conventions. In this article we will use the follow- 
ing notation and conventions. Vectors are denoted by boldface lowercase 
letters, e.g., a. The ith entry of a vector a is denoted Oj. All vectors are 
column vectors. Matrices are denoted by boldface capital letters, e.g., A. 
The element of a matrix A is denoted by aij. The transpose of the ith 
row of a matrix A is aj. We denote the jth column of A with a^). Random 
variables are denoted by capital letters. The 1-norm and 2-norm of a vector 
a are denoted ||a||i and ||a||2, respectively. If / is a univariate function then 
/(a) and /(A) should be interpreted as being evaluated element-wise. The 
kth element in a sequence is denoted by a superscript in parenthesis, e.g., 
a^*^) denotes the kth vector in a sequence of vectors. 

Throughout this article y G {0, 1}" denotes a binary response vector and 
X G ]R"^P denotes a matrix of covariates. We will assume that the columns 
of X are centered. We utilize the generalized linear model framework of 
modeling the mean response as a function of affine combination of the co- 
variates /3o + X/3 where /So G M and /3 G W (McCullagh and Nelder, 1989). 
We will often employ the compact notations X = (1,X) G M"'^(p+^) and 
= (/3o,/3T)T GRP+i. 

3. Standard logistic regression and implosion breakdown. In bi- 
nary regression, we seek to predict or explain an observed response y G 
{0, 1}" using predictors X G MJ^^P, where the n <^ p may be expected. In 
typical expression microarray data we encounter n ~ 100 and p ~ 10^, while 
with single nucleotide polymorphism (SNP) array data both n and p may 
be larger by a factor of 10. Let the conditional probabilities be given by 
P{Yi = l\Xi = Xj) = F{5tje) where F{u) = 1/(1 + exp(-M)). Then un- 
der this assumption, in standard logistic regression (McCullagh and Nelder, 
1989) we minimize the negative log-likelihood of a linear summary of the 
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predictors, 

(3.1) yTX6>-l"riog(l + exp(X6>)). 

A simple univariate example illustrates the bias outliers introduce into this 
estimation procedure. In Figure 1(a) we see that the addition of 5 and 10 
outliers among the controls shrinks towards zero. In fact, Croux, Flandre 
and Haesbroeck (2002) showed that with p covariates only 2p such outliers 
are required to make ||/3||2 < e for any desired e. Our robust estimator, which 
we introduce in the next section, produces virtually the same curves shown 
in Figure 2(a). 

The significance of this "implosion" breakdown phenomenon is that it has 
implications for LASSO based variable selection. Consider what happens 
when we add 999 noise covariates which are independent of the class labels 
to the scenario depicted in Figure 1(a) and perform ^i-penalized logistic 
regression. Figure 1(b) shows the corresponding regularization paths or the 
values of the fitted regression coefficients as a function of the penalization 
parameter. As outliers are added the regularization path for the relevant 
covariate Xi quickly falls into the noise. 

The LASSO performs continuous variable selection by shrinking regres- 
sion coefficients of covariates with very low correlation with the responses 
to zero. If outliers are present in relevant covariates, then the combination 
of implosion breakdown and soft-thresholding by the LASSO can lead to 
missed detection of relevant covariates. In contrast in Figure 2(b) we see 
that the corresponding regularization paths obtained using our robust es- 
timator are insensitive to outliers and so relevant covariates still have the 
chance of being selected. The message is clear; penalized robust estimation 
procedures have practical merit. In the next section we describe our robust 
estimator. 

4. The Minimum Distance Estimator. Let Pg be a probability 
mass function (PMF), specified by a parameter 6 ^ Q CW for some p GN, 
believed to be generating data Yi, . . . ,Yn that take on values in the discrete 
set X- Let P be the unknown true PMF generating the data. If we actually 
knew the true distribution, an intuitively good solution is the one that is 
"closest" to the true distribution. Consequently, as an alternative to using 
the negative log-likelihood, we consider the L2 distance between Pg and P. 
Thus, we pose the following variational optimization problem; we seek 6 G Q 
that minimizes 



(4.1) 



Y,[Pg{y) - P{y)f 
y&x 
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(a) Univariate regression onto XiiThe true logistic curve is black and the MLE is 
orange. The number of outliers (0, 5, 10) increase from the left-most to right-most 
panel. 




(b) Regularization paths: The path for the relevant regression coefficient /3i is blue. 
The number of outliers (0, 5, 10) increase from the left-most to right-most panel; 
999 irrelevant covariates have been added. 



Fig 1. MLE logistic regression with 100 cases and 100 controls. 
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(a) Univariate regression onto Xi:The true logistic curve is black and the L2E is 
orange. The number of outliers (0, 5, 10) increase from the left-most to right-most 
panel. 




(b) Regularization paths: The path for the relevant regression coefficient /3i is blue. 
The number of outliers (0, 5, 10) increase from the left-most to right-most panel; 
999 irrelevant covariates have been added. 



Fig 2. L2E logistic regression with 100 cases and 100 controls. 
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Although finding such a is impossible since P is unknown, it is possible 
to find a that minimizes an unbiased estimate of this distance. Expanding 
the sum in (4.1) gives us 

5^ Pe{y)' - 2 Pe{y)P{y) + P{y)'. 

j/Gx y&x y&x 

The second summation is an expectation E[Pg{Y)] where y is a random 
variable drawn from P. This summation can be estimated from the data by 
the sample mean. The third summation does not depend on 0. With these 
observations in mind, we use the following loss function 

L(0) = ^P,(y)2-^f;p,(y,) 

y&X i=l 

and seek a 9 such that L{6) = min^ge L{0). The estimate is called an L2 
estimate or L2E in Scott (2001). 

Note that the minimization problem is a familiar one associated with 
bandwidth selection for histograms and more generally for kernel density 
estimators (Scott, 1992). Applying a commonly used criterion in nonpara- 
metric density estimation to parametric estimation has the interesting conse- 
quence of trading off efficiency with robustness in the estimation procedure. 
In fact, previously Basu et al. (1998) have described a family of divergences 
which includes the L2E as a special case and the MLE as a limiting case. 
The members of this family of divergences are indexed by a parameter that 
explicitly trades off efficiency for robustness. The MLE is the most efficient 
but least robust member in this family of estimation procedures. The L2E 
represents a reasonable tradeoff between efficiency and robustness. Scott 
(2001, 2004) demonstrated that the L2E has two benefits, the aforemen- 
tioned robustness properties and computational tractability. The tradeoff 
in asymptotic efficiency is similar to that seen in comparing the mean and 
median as a location estimator. Indeed, while other members in this family 
may possess a better tradeoff, the L2E has the advantage of admitting a 
simple and fast computational solution as we will show in Section 5. 

We now show that the L2E method applied to logistic regression amounts 
to solving a non-linear least squares problem. We seek to minimize a surro- 
gate measure of the L2 distance between the logistic conditional probability 
and the conditional probability generating the data. 

If the Xj are unique, then we seek a 6 that "jointly" minimizes n distinct 
L2 distances 

(4.2) [Pe{yi = y\X = ^if-2yiPe{Yi=yi\X = ^i)] 

j/e{o,i} 
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where PeiY = 1\X = Xj) = F(K6). A sensible approach is to minimize a 
weighted average of the n distinct L2 distances 

n 

Y,Wi [Pe{Yi = y\X = ^,f -2y,Pe{Yi = yi\X = x,)]. 

i=i ye{0,i} 

where the weights Wi are non-negative. Up to an additive constant that does 
not depend on 6 the criterion in (4.2) can be compactly written as 

||y - F{±e)\\l, + 11(1 - y) - (1 - FiXeml, = 2||y - F{xe)f^ 

where W = diagjtDi, . . . , Wn} and ||u|| a denotes the semi-norm V u^Au for 
A positive-semidefinite. 

While it may be worth the effort to carefully choose the weights Wi in some 
cases, we will show that in the examples to follow that taking Wi = 1/n for 
i = 1, . . . ,n often works well. Thus, we will seek to minimize 

L(y,X0)= ^||y-F(X0)||2. 

n 

Remarkably, minimizing this unassuming loss function produces robust lo- 
gistic regression coefficients. 

We note that the L2 criterion has been used before for classification prob- 
lems. Kim and Scott (2009, 2010) used the L2 criterion to perform classifi- 
cation using kernel density estimates. Their application of the L2 criterion, 
however, is more in line with its customary use in nonparametric density 
estimation whereas we use it to robustly fit a parametric model. 

5. Estimation with convex quadratic majorizations. We now de- 
rive an algorithm for finding the logistic L2E solution by minimizing a series 
of convex quadratic losses. The logistic L2E loss is not convex. While there is 
no shortage of strategies for solving general unconstrained smooth non-linear 
minimization problems (Nocedal and Wright, 2006), we opt to minimize the 
L2E loss with a majorization-minimization (MM) algorithm (Hunter and 
Lange, 2004; Lange, Hunter and Yang, 2000) because it is numerically sta- 
ble and easy to implement. Most importantly our MM algorithm is also 
easily adapted to handle LASSO-like penalties. 

5.1. Majorization- Minimization. The strategy behind MM algorithms is 
to minimize a surrogate function, the majorization, instead of the original 
objective function. The surrogate is chosen with two goals in mind. First, an 
argument that decreases the surrogate should decrease the objective func- 
tion. Second, the surrogate should be easier to minimize than the objective 
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function. Formally stated, a real-valued function h majorizes a real-valued 
function 5 at v if h{u) > g{u) for all u and h{v) = g{v). 

In words, the surface h lies above the surface g and is tangent to g at v. 
Given a procedure for constructing a majorization, we can define the MM 
algorithm to find a minimizer of a function g as follows. Let v^'^) denote the 
A;th iterate: (1) find a majorization /i(v; v^'^)) of g at v^'^'); (2) set v^'^'^"'^) = 
argmiuv /i(v; v^'^^); and (3) repeat until convergence. This algorithm always 
takes non-increasing steps with respect to g. Consider the iteration starting 
at v^'^). Since v^*^"'"-'^) minimizes /i(v; v^'^')), we have 

g{^(^)) = /^(vW; v^^)) > /.(v^'^+i); v^'^)) > g{v^''+'^). 

By using the MM algorithm, we can convert a hard optimization prob- 
lem (e.g., non-convex, non-differentiable) into a series of simpler ones (e.g., 
smooth convex), each of which is easier to minimize than the original. 

5.2. Majorizing the logistic L2E loss. Recall that the minimization prob- 
lem at hand is 

9 = argminL(y, X0). 

We use the following convex quadratic majorization of L(y, X0) to find 9. 
Theorem 5.1. The following function majorizes L{y, ^9) at 9: 

(5.1) Li9-9) = L{y,X9) + -zlX{9-9)+'l\\X{9-9)\\l 

n " n 

where Zg = 2G[F{±9) - y], G = diag{F(x7^)> • • • > ^(^I^)}, and r] > is 
sufficiently large. 

Thus using the majorization (5.1) in an MM algorithm results in iterative 
least squares. A proof of Theorem 5.1 given in the Appendix A.l. We are able 
to find a simple convex quadratic majorization since the logistic L2E loss 
has bounded curvature. A sharp lower bound on rj is given by the maximum 
curvature of the logistic L2E loss over all parameter values. The bound is 
derived in the Appendix. The practical implication is that the parameter 
r/~^ controls the step size of our iterative solver. Consequently, in practice 
we set r] to its lower bound to take the largest steps possible to speed up 
convergence. 



10 



E. CHI ET AL. 



5.3. Iterative Least Squares. We can express the majorization L{d,6) 
(5.1) as 

Lie, 6) = r?(/3o - /3o - -Zgf + -||C(^) - ^PWl + ^(0), 
f] n 



where 



IX 

Zq = U 1 z^, 



C(0)=X^-r/-i(zg-z^l), 

and K{9) is a constant that does not depend on 0. Since L{0, 6) is separable 
in /3o and (3, we can minimize L{6, 6) with respect to each of them individ- 
uahy. Moreover, when X is fuh rank as is often the case when n > p, then 
the solution to the normal equations is unique and the parameter updates 
are given by 

^^■^^ = - i (X^x) ' X^z^c™, . 

In the taxonomy of optimization procedures, the updates (5.2) constitute 
a Quasi-Newton method with exact line search. Note that the descent direc- 
tion has a simple update since the Hessian approximation is computed only 
once for all iterations. 

5.4. Regularization. The majorization given in Theorem 5.1 can be adapted 
for regularization. It follows immediately that {1/2)L{6;9) + AJ(/3) ma- 
jorizes (l/2)L(y,X0) + AJ(/3) for a penalty function J : RP M+ and 
regularization parameter A > 0. Regularization is useful for stabilizing esti- 
mation procedures. For example if X is not full rank or has a large condition 
number we can add a ridge penalty. We then seek the minimizer to the fol- 
lowing problem 



nnn -^\\y - Fi±e)\\l + xh(3\\l 
eeRp+^ In I 



which we can solve by minimizing the majorization L{d,6) + A||/3||2. Since 
the intercept is not penalized, the intercept updates are the same as in (5.2). 
The update for /3 becomes 

(5.3) /3("^+i) = /3("^) - ^(X"^X + AI)-ix"rz^(™). 
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Under suitable regularity conditions the MM algorithm for solving the 
ridge penalized logistic L2E problem is guaranteed to converge to a sta- 
tionary point of -L(y,X0) + A||/3||2- This follows from global convergence 
properties of MM algorithms that involve continuously differentiable objec- 
tive and majorization functions (Lange, 2010). On the other hand, the MM 
algorithm for the unregularized version of the problem is not guaranteed to 
converge based on the sufficient conditions given in Lange (2010) because 
the objective function is not coercive (i.e., not all its level sets are compact) 
and the quadratic majorization is not strictly convex in 6 unless X is full 
rank. Adding the ridge penalty remedies both situations and sufficient con- 
ditions for global convergence are met. In our experience, however, when 
n > p the iteration rules given in (5.2) do not appear to have convergence 
issues. 

Another reason to consider regularization is to perform continuous vari- 
able selection via a LASSO-like penalty. In particular, consider the penalized 
majorizer for the L2E loss regularized by the Elastic Net penalty. 



(5.4) -me) + A {a\mi + '-^\m2 

Since our work is motivated by genomic data which is known to have cor- 
related covariates, we will focus on the Elastic Net penalty because it pro- 
duces sparse models but includes and excludes groups of correlated variables 
(Zou and Hastie, 2005). The LASSO, in contrast, tends to select one covari- 
ate among a group correlated covariates and exclude the rest. If groupings 
among the covariates are known in advance, a group LASSO penalty could 
be used (Yuan and Lin, 2006). The Elastic Net penalty is useful in that it 
performs group selection without prespecification of the groups. 

We are interested in generating MM iterates 0^™^ = (/3q™\/3^'"M where 



(5.5) 



o{m+l) _ aim) -\ 



= arg^in f ||C(0('")) - X/3||2 + A (ami + ^^^11/311^) • 

Before discussing how to practically solve the surrogate minimization 
problem, note that regardless of how (5.5) is solved, we have the follow- 
ing guarantee on the convergence of the MM iterates. 



Theorem 5.2. Under suitable regularity conditions for any starting point 
6^'^\ the sequence of iterates 0^^\6^'^\... generated by (5.5) converges to a 
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stationary point of 

^\\y- F{±e)g + A (^a||/3||i + ^i^||/3||i 

where A > and a G [0, 1). 

A proof is given in the Appendix A. 2 and relies on a recent extension of the 
global convergence properties of MM algorithms for locally Lipschitz con- 
tinuous objective and majorization functions (Schifano, Strawderman and 
Wells, 2010). Note that Theorem 5.2 restricts a < 1, i.e., algorithmic con- 
vergence of the LASSO regularized logistic L2E is not guaranteed. This 
condition is imposed to ensure that the majorization is strictly convex in f3. 
Again in our experience, the LASSO regularized logistic L2E does not have 
algorithmic convergence issues. 

As a final remark on algorithmic convergence note that since the ridge 
penalty is a special case of the Elastic Net, Theorem 5.2 implies that ridge 
penalized logistic L2E (5.3) will also converge. 

5.5. Coordinate Descent. The optimization problem for updating (3 in 
(5.5) can be written as an ^i-penalized least squares problem. 



/3("^+i) =argmin^ 
2n 



; WX{l-a)I 



(3 



2 



+ Xa\\(3\\i- 

2 



Thus, minimizing the Elastic Net penalized logistic L2E loss can be cast 
into a series of LASSO regressions. We choose to solve (5.5) with coordinate 
descent which has been shown to efficiently solve penalized regression prob- 
lems when selecting relatively few groups of correlated predictors (Friedman 
et al., 2007; Wu and Lange, 2008). 

Coordinate descent is a special case of block relaxation optimization 
where, in a round-robin fashion, we optimize the objective function with 
respect to each coordinate at a time while holding all other coordinates 
fixed. Formally, if we are minimizing a multivariate function / at the A;th 
round of coordinate descent for the jth coordinate we solve 

(5.6) G argmin /(<), . . . , e\%e, ef^,^\ . . . , Of-^)). 



The jth coordinate update during the kt\i round of cyclic coordinate de- 
scent of the mth MM iteration is well defined, i.e., there exists a unique 
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Algorithm 1 Iterative L2E solver 

6 initial guess 
repeat 

p ^ F(X6») 

G ^ diag{p * (1 - p)} 

z^2G(p-y) 

C^X/3-i(z-^l) 

Po A) - ^"^^ 

repeat 

for k = l..p do 

r ^ C - (X/3 - /3feXfc) 

^5(^xIr,Aa)/[^|lx,||i + A(l-a)] 
end for 
until convergence 
until convergence 
return 9 



minimizer in (5.6). The update ^{"^'^^ has a simple form (Donoho and John- 
stone, 1995) and is given by the subgradient equations to be 

SK-)lli + A(l-a)' 
where r^™''^'-'^ is a vector of partial residuals with ith entry 



{■m,k,j) _ A,(a{m) 



j'=i j'=j+i 



and S is the soft-threshold function 



S{a, A) = sign(a) max(|a| — A, 0). 

Algorithm 1 gives pseudocode for the resulting iterative solver. The sym- 
bol * denotes the Hadamard element-wise product. In practice we also use 
active sets to speed up computations. That is, for a given initial (3, we only 
update the non-zero coordinates of f3, the active set, until there is little 
change in the active set parameter estimates. The non-active set parameter 
estimates are then updated once. If they remain zero, the Karush-Kuhn- 
Tucker (KKT) conditions have been met and a global minimum of (5.5) has 
been found. If not, then the active set is expanded to include the coordinates 
whose KKT conditions have been violated and the process is repeated. 

6. Choosing the penalty parameters. 
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6.1. Warm Starts and Calculating Regularization Paths. We will need to 
compare the regression coefficients obtained at many values of the penalty 
parameter A to perform model selection. Typically we can rapidly calculate 
regression coefficients for a decreasing sequence of values of A through warm 
starts. For A sufficiently large, only the intercept term will come into the 
model. The smallest A* such that all regression coefficients are shrunk to 
zero is given by 

2 

(6.1) A* = — y{l -y) max |xT.xy|. 

na j=i,...,p 

We compute a grid of A values equally spaced on a log scale between Amax = 
A* and Amin = eAmax where e < 1. In practice, we have found the choice of 
e = 0.05 to be useful. In general, we are not interested in making A so small 
as to include all variables. 

Moreover, due to the possible multi-modality of the L2E loss, we recom- 
mend computing the regulation paths starting from a smaller regularization 
parameter and increasing the parameter value until Amax- Since we face 
multi-modality initial starting points can make a significant difference in 
the answers obtained. 

6.2. The heuristic for choosing starting values. Since the logistic L2E 
loss is not convex, it may have multiple local minima. For the purely lasso- 
penalized problem, the KKT condition at a local minimum is 

|xj;.)G(y-F(/3ol + X/3))| < A. 

Equality is met whenever j3j 7^ 0. Thus, the largest values of 

|x[.)G(y-F(/3ol + X/3))| 

will correspond to a set of covariates which include covariates with non-zero 
regression coefficients. The leap of faith is that the largest values of 

|x[.)G(y-F(/3ol + X/3))| 

evaluated at the null model will also correspond to a set of covariates which 
include covariates with non-zero regression coefficients. This idea has been 
used in a "swindle" rule (Wu et al., 2009), SAFE rules for discarding vari- 
ables (El Ghaoui, Viallon and Rabbani, 2010), and STRONG rules for dis- 
carding variables (Tibshirani et al., 2011). In those instances the goal is to 
solve a smaller optimization problem. In contrast, we initialize starting pa- 
rameter entries to zero rather than excluding variables with low scores from 
the optimization problem. 
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Calculate the following scores Zj 

Zj = |xJ)Go(y -pl))|, 

where p = y the sample mean of y and Gq = p(l — 

Let S = {j : Zj is "large" } . Assign the starting parameter value as follows 

log(y/(l - I/)) 
1 otherwise. 

6.3. Robust Cross- Validation. Once we have a set of models computed at 
different regularization parameter values, we select the model that is optimal 
with respect to some criterion. We use the following robust 10-fold cross- 
validation scheme to select the model. After partitioning the data into 10 
training and test sets, for each i = 1, . . . , 10 folds we compute regression 
coefficients *(A) for a sequence of A's between Amax and Amin holding out 
the ith test set Si. 

Next we refit the model using the reduced variable set Sf, those with 
nonzero regression coefficients, and refit using logistic L2E with a = 0. This 
refitting produces less biased estimates. We are adopting the same strat- 
egy as LARS-OLS in Efron et al. (2004). Our framework, however, could 
adopt a more sophisticated strategy along the lines of the Relaxed LASSO 
in Meinshausen (2007). Henceforth let (A) denote the regression coeffi- 
cients obtained after the second step. Let d~*(A) denote the contribution of 

observation j to the L2E loss under the model 9 (A), i.e., 

dj\x) = {yj-F{sqe'\x))y. 

We use the following criterion to choose A*: 

A* = argmin medianj=i^..,^io median^g^- d~\X). 

X 

The reason for choosing A in this way is due to a feature of the robust 
fitting procedure. Good robust models will assign unusually large values of 
d~^{X) to outliers. Thus, the total L2E loss is an inappropriate measure of 
the prediction error if influential outliers were present. On the other hand, 
taking the median, for example, would provide a more unbiased measure of 
the prediction error regardless of outliers. The final model selected would be 
the one that minimizes the robust prediction error criterion. 
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7. Simulations. In this section we report on three simulations compar- 
ing the MLE and L2E results. The first two simulations examine the accu- 
racy of estimation. We then follow with a simulation experiment designed to 
examine the variable selection properties. For the first two simulations we 
generated 1000 data sets, with 200 binary outcomes each associated with 4 
covariates, from the logistic model specified by the likelihood in (3.1) with 
parameters /3o = and (3 = (1,0.5,1,2)"'". The covariates Xj were drawn 
from one of two populations. 

^ ^,i^d. iN{fx,0.16Ip) i = l,...,WO 

\n{-h,0.16Ip) i = 101,..., 200 

where p = 4 and /x = (0.25, 0.25, 0.25, 0.25)"^". The responses were generated 
from the model 

y . "-Jd- Bernoulli(F(x7/9), 1). 

7.1. Estimation in Low Dimensions. In the first scenario, we added a 
single outlier, (2/201; X201) where 2/201 = and X201 = {6,6,6,6)'^ and 6 took 
on values in {—0.25,1.5,3,6,12,24}. In words, the 201st point was moved 
in covariate space along the line that runs through the centroids of the 
two subpopulations. In the second scenario, we added a variable number of 
outliers at a single location: {(2/i, Xj)}^2oi' where Ui = and Xj = (3, 3, 3, 3)""" 
for i = 201, . . . ,N and the number of outliers is = 0, 1, 5, 10, 15, 20. For 
each sequence of scenarios described, we performed logistic regression and 
L2E regression. Tables 1 and 2 show the mean and standard deviation for 
the fitted coefficient values in the first and second scenario, respectively. 

The results show two features of the L2E versus the MLE. Consider the 
first scenario. The MLE becomes increasingly biased towards zero as the 
201st point is moved from —0.25 to to 24. In contrast, the L2E is insensitive 
to the placement of the 201st point. Figure 3 shows how \\(3\\2 under each 
estimation procedure varies with the position of outlier is moved. We see that 
MLE values demonstrate "implosion" breakdown, i.e., \\0\\2 tends towards 
as the leverage of the 201st point increases. The L2E estimates do not. 
The second observation is the cost of the L2E's unbiasedness is increased 
variance as seen in the increased standard error in Table 1. The L2E's sample 
standard error is greater than the MLE's for all locations of the outlier. 
Similar behavior is observed in the second scenario. Figure 4 shows that 
"implosion" breakdown ensues as outliers are added. 

7.2. Variable Selection in High Dimensions. In the variable selection ex- 
periment we considered a high dimensional variation on the first scenario. 
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Fig 3. The 2-norm of the regression coefficients (intercept not included) as a function of 
a single outliers position. 




Fig 4. The 2-norm of the regression coefficients (intercept not included) as a function of 
the number of outliers at a fixed position. 
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Table 1 

Varying the location of a single outlier: The estimates calculated by L2 E are essentially 
unbiased regardless of the location of the outlier. In contrast, the MLE results become 
very biased as the outlier position ranges from —0.25 to 24. The unbiasedness of the L2E 
comes at a price of increased variance. The sample standard error of the L2E is greater 
than that of the MLE for all outlier positions. 

MLE L2E 



Outlier Position 


Coefficient 


True Value 


mean 


std 


mean 


std 









-0.002 


0.182 


-0.005 


0.192 




h-'l 


1 


1.032 


0.434 


1.063 


0.480 


-0.25 


3 


0.5 


0.526 


0.424 


0.539 


0.463 






1 


1.047 


0.439 


1.079 


0.482 




/34 


') 


L'.iiO 


().icS7 


2. 181 


0.572 









-0.024 


0.168 


0.002 


0.192 




3, 


1 


0.868 


0.394 


1.052 


0.476 


1.5 


3„ 

l-'2 


0.5 


0.401 


0.391 


0.532 


0.460 




03 


1 


0.880 


0.396 


1.068 


0.478 




04 


2 


1.860 


0.430 


2.160 


0.567 




R 

Po 





-0.022 


0.157 


0.002 


0.192 




3, 


1 


0.732 


0.368 


1.054 


0.476 


3 


3„ 

h'2 


0.5 


0.296 


0.369 


0.533 


0.460 






1 


0.743 


0.368 


1.069 


0.478 




04 


2 


1.662 


0.392 


2.163 


0.567 




00 





-0.020 


0.142 


0.002 


0.192 




01 


1 


0.508 


0.337 


1.054 


0.476 


6 


02 


0.5 


0.112 


0.344 


0.533 


0.460 




03 


1 


0.516 


0.334 


1.069 


0.478 




04 


2 


1.350 


0.347 


2.163 


0.567 




00 





-0.018 


0.128 


0.002 


0.192 




01 


1 


0.153 


0.325 


1.054 


0.476 


12 


02 


0.5 


-0.201 


0.336 


0.533 


0.460 




03 


1 


0.158 


0.316 


1.069 


0.478 




04 


2 


0.906 


0.317 


2.163 


0.567 




00 





-0.011 


0.124 


0.002 


0.192 




01 


1 


-0.088 


0.330 


1.054 


0.476 


24 


02 


0.5 


-0.431 


0.331 


0.533 


0.460 




03 


1 


-0.086 


0.315 


1.069 


0.478 




04 


2 


0.641 


0.324 


2.163 


0.567 



We generated 10 data sets each with n = 500 observations. The covariates 
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Table 2 

Varying the number of outliers at a fixed location: The estimates calculated by L2E are 
essentially unbiased of the number of outliers. In contrast, the MLE results become very 

biased as outliers are added. The unbiasedness of the L2E comes at a price of increased 
variance. The sample standard error of the L2E is greater than that of the MLE for all 

numbers of outliers. 



MLE L2E 



Number of Outliers 


Coefficient 


True Value 


mean 


std 


mean 


std 









0.0049 


0.1824 


0.0021 


0.1923 




h'l 


1 


1.0258 


0.4326 


1.0537 


0.4759 





h'2 


0.5 


0.5213 


0.4225 


0.5327 


0.4599 




03 


1 


1.0405 


n.4376 


1.0690 


0.4782 






2 


2.091)1 


0. m:i 


2.1():!() 


0. ."•,(;(;(; 




R 

Pq 





-0.0221 


0.1573 


0.0021 


0.1923 




3, 


1 


0.7324 


0.3679 


1.0537 


0.4759 


1 


3„ 

/-'2 


0.5 


2Q5fi 


3690 


0.5327 


0.4599 




03 


1 


0.7431 


0.3681 


1.0690 


0.4782 




0, 


2 


1.6620 


0.3924 


2.1629 


0.5666 











0.1258 


0.0021 


0.1923 




3, 


1 


0.0864 


0.3201 


1.0537 


0.4759 


5 


R 
P2 


0.5 


-fl 9698 


0.3272 


0.5327 


0.4599 






1 


0.0905 


0.3082 


1.0690 


0.4782 




04 


2 


0.8300 


0.3125 


2.1629 


0.5666 




00 





-0.1101 


0.1237 


0.0021 


0.1923 




01 


1 


-0.0735 


0.3296 


1.0536 


0.4759 


10 


02 


0.5 


-0.4167 


0.3332 


0.5326 


0.4599 




03 


1 


-0.0709 


0.3153 


1.0690 


0.4782 




04 


2 


0.6586 


0.3226 


2.1628 


0.5667 




00 





-0.1172 


0.1237 


0.0021 


0.1923 




01 


1 


-0.1268 


0.3354 


1.0536 


0.4759 


15 


02 


0.5 


-0.4696 


0.3385 


0.5326 


0.4599 




03 


1 


-0.1245 


0.3208 


1.0689 


0.4782 




04 


2 


0.6048 


0.3282 


2.1627 


0.5667 




00 





-0.1216 


0.1238 


0.0021 


0.1923 




01 


1 


-0.1586 


0.3393 


1.0535 


0.4759 


20 


02 


0.5 


-0.5016 


0.3423 


0.5326 


0.4599 




03 


1 


-0.1566 


0.3246 


1.0689 


0.4782 




04 


2 


0.5735 


0.3318 


2.1626 


0.5668 



were drawn from one of three multivariate normal populations 



i.i.d. 



'iV(/Lt, 0.75 Ip) i 
Ar(-/Lt,0.75Ip) i 
^Ar(iy,0.25Ip) i 



1,...,200 
201, . . . , 400 
401, . . . , 500 
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where p = 500 and 

/x = ( 0.3,...,0.3 ,0,...,q) 

50 450 

u = (1^^^,0^_^). 

50 450 

The responses were generated as follows 



i.i.d. 



fBERNOULLl(F(x7/3),l) i = l,...,400 
1o i = 401,..., 500, 



where /3o = and 



/3 = (1^_^,0^_^) G 

50 450 



[,500 



We then performed Elastic Net penalized regression (a = 0.6) with the 
MLE and L2E. For L2E, we chose the initial starting point according to the 
heuristic described in Section 6.2. To perform model selection we generated 
regularization paths. That is we calculated penalized regression coefficients 
for a range of A values using the robust cross-validation method described 
in Section 6. To perform the elastic net penalized logistic regression we used 
the glmnet package in R (Friedman, Hastie and Tibshirani, 2010). We also 
compared the robust classifier of Wang, Zhu and Zou (2008) - the Hybrid 
Huberized Support Vector Machine (HHSVM) using an MM algorithm. De- 
tails of the implementation can be found in a supplement on the author's 
website.^ Wang, Zhu and Zou (2008) provide code for computing the solu- 
tions paths of the HHSVM, but the algorithm used calculates the paths for 
a varying LASSO regularization parameter with a fixed ridge regularization 
parameter because they can be computed quickly by exploiting the piece- 
wise linearity of the paths under that parameterization of the Elastic Net. 
Our implementation calculates regularization paths using the Elastic Net 
parameterization used in this manuscript. 

Tables 3 and Table 4 show the number of true positives and false positives 
respectively for each method. We see that in scenarios of heavy contamina- 
tion the L2E demonstrates superior sensitivity and specificity compared to 
both the MLE and HHSVM. It is interesting to note that the MLE tends to 
be more sensitive than the HHSVM, but at a cost of being drastically less 
specific. 



^ http://www.stat. lsa.umich.edu/ jizhu/code/hhsvm/ 
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Table 3 

True positive count with n — p = 500 and 50 true covariates. L2E is the most sensitive 
method. HHSVM is the least sensitive method. 





1 


2 


3 


4 


Replicate 
5 6 


7 


8 


9 


10 


MLE 


14 


10 


8 


10 


1 


10 





14 


11 


15 


HHSVM 


1 


3 


2 


2 


1 


2 


1 


2 


4 


2 


L2E 


48 


47 


48 


49 


48 


48 


49 


46 


48 


49 



Table 4 

False positive count with n — p = 500 and 50 true covariates. L2E is the most specific 
method. MLE is the least specific method. 





1 


2 


3 


4 


Replicate 
5 6 


7 


8 


9 


10 


MLE 


141 


95 


56 


148 





141 





128 


136 


170 


HHSVM 





4 


1 


1 


1 





1 











L2E 








2 











1 


1 





1 



Figure 5 shows the robust cross validation curves for the three methods 
for one of the rephcates. Note the large jump in the L2E curve. By choosing 
the starting L2E point by our heuristic, a local minimum different from the 
MLE solution is found. For sufficiently large A, however, the local minimum 
vanishes, and the regularization paths mimic the MLE regularization paths. 

8. Real data examples. 




Fig 5. Robust 10-fold cross-validation curves for the three methods. The green square 
indicates the minimizing X. The orange triangle indicates the 1-MAD rule A. 
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Fig 6. Regularization paths for the three methods. True regression coefficients for true 
covariates are in red. 



8.1. An n > p example: Predicting abnormal and normal vertebral columns. 
We first consider a real data set in the n > p regime. We present results on 
the vertebral column data set from the UCI machine learning repository, as 
described by Frank and Asuncion (2010). The data set consists of 310 pa- 
tients which have been classified as belonging to one of three groups: Normal 
(100 patients), Disk Hernia (60 patients). Spondylolisthesis (150 patients). 
In addition to a classification label, six predictor variables are recorded for 
each patient: pelvic incidence (PI), pelvic tilt (PT), lumbar lordosis angle 
(LLA), sacral slope (SS), pelvic radius (PR) and grade of spondylolisthesis 
(GS). All six predictor variables are continuous valued. 

We consider the two class problem of discriminating normal vertebral 
columns from abnormal ones (Disk Hernia and Spondylolisthesis). Figure 7 
plots the values of individual covariates for each patient. Table 5 shows the 
correlations between pairs of attributes. Note that the attributes for Disk 
Hernia and Normal patients overlap a good deal. We may expect similar re- 
sults as seen in the second simulation scenario described in Section 7.1 where 
Disk Hernia patients play the role of a cluster of outlying observations. Due 
to the correlation, however, the outlying observations are not as distinctly 
outlying as seen in the simulation examples of Section 7.1. Consequently, 
it also might be anticipated that there will not be differences between the 
MLE and L2E regularization paths. Indeed, Figure 8 shows the resulting 
regularization paths generated by the MLE and logistic L2E for a = 0.2. 
The paths are very similar for both methods for other values of a and are not 
shown. Different initial starting points did not change the resulting logistic 
L2E regularization paths. 
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Fig 7. Six biomechanical attributes: pelvic incidence (PI), pelvic tilt (PT), lumbar lordosis 
angle (LLA), sacral slope (SS), pelvic radius (PR) and grade of spondylolisthesis (GS). 
There are three true classes of observations (Disk Hernia, Spondylolisthesis, and Normal). 
Disk Hernia and Spondylolisthesis are lumped into the class Abnormal. 
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PI 


PT 


LLA 


SS 


PR 


GS 


PI 


1.00 


0.63 


0.72 


0.81 


-0.25 


0.64 


PT 


0.63 


1.00 


0.43 


0.06 


0.03 


0.40 


LLA 


0.72 


0.43 


1.00 


0.60 


-0.08 


0.53 


SS 


0.81 


0.06 


0.60 


1.00 


-0.34 


0.52 


PR 


-0.25 


0.03 


-0.08 


-0.34 


1.00 


-0.03 


GS 


0.64 


0.40 


0.53 


0.52 


-0.03 


1.00 


Table 5 



Correlations among the six biomechamcal attributes in the vertebrae data set. 




Fig 8. The regularization (a = 0.2J paths for the MLE and L2E are very similar for the 
six biomechamcal attributes in the vertebrae data set. 

8.2. An n <^ p example: a genome wide association study. We examine 
the lung cancer data of Amos et al. (2008). The purpose of this genome wide 
association study was to identify risk variants for lung cancer. The authors 
employed a two stage study using 315,450 tagging SNPs in 1,154 current and 
former (ever) smokers of European ancestry and 1,137 frequency matched, 
ever-smoking controls from Houston, Texas in the discovery stage. The most 
significant SNPs found in the discovery phases were then tested in a larger 
replication set. Two SNPs, rsl051730 and rs8034191, on chromosome 15 were 
found to be significantly associated with lung cancer risk in the validation 
set. 

In this section we reexamine the discovery data using logistic L2E and the 
logistic MLE. Note that it is current practice of geneticists to do univariate 
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inference with an adjustment for multiple testing and this approach was 
taken in Amos et al. (2008). Taking a multivariate approach as will be done 
in this section, however, allows the analyst to take into account dependencies 
between the SNPs. As an initial comparison we consider a subset of the entire 
data set and restrict our analysis to SNPs on chromosome 15. We impute 
missing genotypes at a SNP by using the MACH 1.0 package, a Markov 
Chain based haplotyper (Li, Ding and Abecasis, 2006). After missing data 
are imputed and keeping only imputations with a quality score of at least 
0.9, 8,701 SNPs are retained on 1152 cases and 1136 controls. 

Figures 9(a) and 9(b) summarize the variable selection results for the 
logistic L2E and MLE for a = 0.05,0.5, and 0.95. Recall from (5.4) that 
small a emphasizes the ridge part of the penalty while large a emphasizes 
the LASSO part of the penalty. 

SNP markers can have a high degree of collinearity due to recombination 
mechanics. SNPs that are physically close to each other tend to be highly 
correlated and are said to be in linkage disequilibrium. The pair rsl051730 
and rs8034191 for example are in "high" linkage disequilibrium. 

There are three things to note. First, the regularization paths for the L2E 
and MLE are almost identical. Second, both methods produce regularization 
paths that identify rsl051730 (orange) and rs8034191 (blue) as having the 
greatest partial correlation the case/control status. Third, the paths for 
rsl051730 and rs8034191 behave as would be expected with a. For small 
a, or more ridge-like penalty, the two paths become more similar. For large 
a, or more LASSO-like penalty, only one of the two correlated predictors 
enters the model while the other is excluded. 

9. Discussion. Outliers can introduce bias in some commonly used 
maximum likelihood estimation procedures. This well known fact, however, 
warrants attention because bias can have material effects on the ubiquitous 
LASSO-based variable selection procedures. In the context of standard logis- 
tic regression, influential outliers cause implosion breakdown. In this paper 
we have demonstrated that the combination of implosion breakdown and 
the soft-thresholding mechanism of LASSO variable selection can result in 
missed detection of relevant predictors. 

To guard against the undue influence of outliers on estimation and vari- 
able selection, we propose a robust method for performing sparse logistic 
regression. Our method is based on minimizing the estimated L2 distance 
between the logistic parametric model and the underlying true conditional 
distribution. The resulting optimization problem is a penalized non-linear 
least squares problem which we solve with an MM algorithm. Our MM 
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0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.0 0.1 0.2 0.3 0.4 0.0 0.2 0.4 0.6 0.8 

^aldf + (1-a)ej/2 

(a) LaE 




0.000 0.005 0.010 0.015 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.2 0.4 0.6 0.8 1.0 

2alBil + (1-a)ej/2 



(b) MLE 

Fig 9. Regularization paths of regression coefficients of SNP markers on Chromosome 15 
for L2E and MLE for a = 0.05, 0.5, and 0.95. The regularization path for rsl051730 is in 
orange; the path for rs8034191 is in blue. The L2E and MLE paths are nearly identical. 
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algorithm in turn reduces the optimization problem into solving a series 
penalized least squares problems whose solution paths can be solved very 
efficiently with coordinate descent and warm starts. 

Although we present our work as a method for robust binary logistic re- 
gression, our method immediately extends to other related contexts. Our 
algorithm can be extended to handle more than two classes. The generaliza- 
tion to the K-class multinomial is straightforward. 

K 

L(Y,X0) = J]||y,-Ffc(X0)||2, 

k=l 

where yik = 1 if the ith observation belongs to class k and otherwise and 
the ith element of vector Fk(X.&) is given by 

exp(x70fc) 
1 + Ef=iexp(x70,)' 

This non-linear least squares problem also has bounded curvature and con- 
sequently can also be solved by minimizing a sequence of LASSO-penalized 
least squares problems. 

Our algorithm can also be used as a subroutine in performing robust bi- 
nary principal component analysis and, more generally, robust binary tensor 
decompositions. A common strategy in array decompositions for multiway 
data, including multiway binary data, is to use block coordinate descent 
or alternating minimization (Chi and Kolda, 2011; Collins, Dasgupta and 
Schapire, 2002; Kolda and Bader, 2009; Lee, Huang and Hu, 2010). For bi- 
nary multiway data, each block minimization would perform a robust logistic 
regression step. 

We want to make clear that the logistic L2E is not a competitor to the 
MLE but rather a complement. Both methods are computationally feasible 
and can be run on data together. As seen in the real data examples of 
Section 8, sometimes the logistic L2E recovers the MLE solution. On the 
other hand, when discrepancies do occur, taking the MLE and L2E solutions 
together can provide insight into the data that would be harder to identify 
with the MLE solution alone. 

We close with some interesting directions for future work. We have seen 
that LASSO-based variable selection in the presence of implosion breakdown 
can lead to missed detection of relevant predictors. This motivates the ques- 
tion of whether explosion breakdown can lead to the inclusion of irrelevant 
predictors. Finally, with respect to convergence issues of our algorithm, while 
we have established conditions under which our algorithm is guaranteed to 
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converge to a stationary point we do not have rigorous results on the rate at 
which it does so. As a complement to methods that may be sensitive to the 
presence of outliers, characterizing the convergence speed of our algorithm 
has a great deal of practical importance. 

The code for both logistic L2E and HHSVM will be made available, and 
the authors hope to receive feedback on its use. 

APPENDIX A: PROOFS 

A.l. Proof of Theorem 5.1. It is immediate that 0) = L(y, X0). 
We turn our attention to proving that L{9;0) > L(y, X0) for all 0,6 G 
W^^'^. Since L(y,X0) has bounded curvature our strategy is to represent 
L(y, X0) by its exact second order Taylor expansion about 6 and then 
find a tight uniform bound over the quadratic term in the expansion. This 
approach applies in general to functions with continuous second derivative 
and bounded curvature (Bohning and Lindsay, 1988). 

The exact second order Taylor expansion of L{y, X0) at is given by 

L(y, Xe) = L(y, XO) + {6- ~e)''VL{y, Xe) + ^{6- 0)''ile^ (0-0), 
where 6* = ^0 + (1 — 7)0 for some 7 G (0, 1) and 



VL{y,Xe) 


= 4n-^X'^G{p-y] 


\ 


He 


= -X"^M0X, 

n 




G 


= diag{pi, ...,pn} 




Me 


= diag{^puApi), • • • , 


(Pn)} 


u 


= 2y-l 




P 


= F{xe) 






= [2p{l - p) - {2p - 


-l)((2p-l) 



Note that (Me)jj is bounded from above, i.e., supg^Q(M.0)ii < 00. We 
now introduce a surrogate function: 

L{e; 6) = L(y, XO) + -(61 - e)'^X'^G{F{Xe) - y) + ^(6> - 9)'^X'^X{e - 6), 

n n 



where 



t] > max < sup il^-i{p), sup "01 (p) } ■ 
lpe[o,i] pe[o,i] I 
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Note that for any 6 G W^'^ , (Me)ii < ry. Therefore, 

{e - eYne-{e -e) = {e- eyyJM.e*y^{e - e) 

and consequently L{0; 6) majorizes L(y, X0) at 9. □ 
The following observations lead to a simpler lower bound on r/. Note that 

sup ip-i{p) = sup -01 (p), 
pe[o,i] pe[o,i] 

since tjj-i{p) = ipi{l — p). So, the lower bound on rj can be more simply 
expressed as 

(A.l) sup = max = ^ max - - 2q^ + q +W . 

pe[o,i] P6[0-1] 4ge[-i,i] [2 2 J 

The first equality follows from the compactness of [0, 1] and the continuity 
of il^i{p)- The second equality follows from reparameterizing ipi{p) in terms 
of g = 2p — 1. Since the derivative of the polynomial in (A.l) has a root at 
1, it is straightforward to argue that the lower bound of r/ is attained at the 
second largest root, which is (—3 + \/33)/12. Thus, the majorization holds 
so long as 



3 4 1 o 1 2 1 1 
^^16^ -4^ -2'^ +4'^+16 



„_ -3+733 



A. 2. Proof of Theorem 5.2. A key condition in MM algorithm con- 
vergence proofs is coerciveness since it is a sufficient condition to ensure the 
existence of a global minimum. Recall that a continuous function f : U C 
— )• M is coercive if all its level sets 5f = {x G [/ : /(x) < t} are compact. 

We will use the MM algorithm global convergence results in Schifano, 
Strawderman and Wells (2010). Let ^(0) denote the objective function and 
let ?['^](6>,0) denote a surrogate objective function that will be minimized 
with respect to its ffist argument in lieu of £,{6). The iteration map if is 
given by 

ip{6) = argmin 

We now state a slightly less general set of regularity conditions than those in 
Schifano, Strawderman and Wells (2010) that are sufficient for our purposes. 
Suppose ^,^^'5']^ and (p satisfy the following set of conditions: 
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Rl. The objective function ^(0) is locally Lipschitz continuous for 6 £ Q 
and coercive. The set of stationary points S of ^(0) is a finite set, 
where the notion of a stationary point is defined as in Clarke (1983). 

R2. ^{6) = ^^^\e,e) for ah G 6. 

R3. ^^^^{e,e) < ^^^\e,e) for all 9,e G G where 6^6. 

R4. is continuous for (6,0) £ Q x Q and locally Lipschitz in Q. 

R5. ip{9) is a singleton set consisting of one bounded vector for 6 £ Q. 

Then {0^"\n > 0} converges to a fixed point of the iteration map ip. By 
Proposition A. 8 in Schifano, Strawderman and Wells (2010) the fixed points 
of ip coincide with S. 

In our case we have the following objective and surrogate functions 

m = ^iiy - nmwl + a (am, + ^-^^ml) 

e[^](0,0) = h{ere) + x(^a\\f3\\i + • 

We check each regularity condition in turn. 

Rl. Since ||y — F(X0)||2 is bounded below and the penalty term is co- 
ercive, ^{0) is coercive. Recall that the gradient of the L(y, X0) is 
(4/n)XTG(F(X0) -y). The norm of the gradient is bounded; specifi- 
cally it is no greater than laf where cri is the largest singular value of 
X. Therefore, L{'y, X0) is Lipschitz continuous and therefore locally 
Lipschitz continuous. Consequently, ^(0) is locally Lipschitz continu- 
ous. If the set of stationary points of ^(0) is finite, then Rl is met. 
R2 and R3. Recall the majorization we are using is given by 

L{0- 6) = L(y, X0) + (6> - ^)^VL(y, X0) + ^(6> - e)'^X.'^X{e - 6), 

n 

where 

To ensure that the majorization is strict we need the inequality to be 
strict. Thus, the curvature of the majorization exceeds the maximum 
curvature of L{y, X0) and the majorization is strict. R2 and R3 are 
met. 

R4. The penalized majorization is the sum of continuous functions in (0, 6) G 
0x0 and is consequently continuous. The penalized majorization as a 
function of its first argument is the sum of a positive definite quadratic 
function and the 1-norm function, both of which are locally Lipschitz 
continuous so their sum is locally Lipschitz continuous. R4 is met. 
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R5. If A(l - a) > then £.^^\9, 6) is strictly convex in and thus has at 
most one global minimizer. Since ^''^1(0,0) is also coercive in 6 it has 
at least one global minimizer. R5 is met. 

Thus, Algorithm 1 will converge to a stationary point of ^,{0), provided that 
there are only finitely many stationary points and the coordinate descent 
minimization of the Elastic Net penalized quadratic majorization is solved 
exactly. □ 

Remark 1. //^ does not have finitely many stationary points, it can be 
shown that the limit points of the sequence of iterates are stationary points 
and that the set of limit points is connected (Chi, 2011; Schifano, Strawder- 
man and Wells, 2010). 

Remark 2. The iterate update = if[6^'^^) can be accomplished 

by any means algorithmically so long as the global minimum of the majoriza- 
tion is found. Iterates of coordinate descent are guaranteed to converge to a 
global minimizer provided that the loss is differ entiable and convex and the 
penalty is convex and separable (Tseng, 2001). Thus, applying coordinate de- 
scent on the Elastic Net penalized quadratic majorization will find the global 
minimum. 

Remark 3. Our definition of stationary points has to change because the 
objective functions of interest are locally Lipschitz continuous and therefore 
differentiable almost everywhere except on a set of Lebesgue measure zero. 
Clarke (1983) defines and proves properties of a generalized gradient for 
locally Lipschitz functions. Apart from pathological cases, when a function 
is convex the generalized gradient is the subdifferential. See Proposition 2.2.7 
in Clarke (1983). When a function is differentiable the generalized gradient 
is the gradient. Thus as would be expected a point x is a stationary point of a 
locally Lipschitz function if the function's generalized gradient at x contains 
0. 
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